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We report here a first draft of a genome's transcriptional regulatory code, the set 
of sequences utilized by DNA-binding regulators to control genome expression 
programs. The code was derived by combining data on genomic binding locations 
for transcriptional regulators in yeast cells grown in multiple environmental 
conditions, knowledge of genome sequence conservation and prior evidence for 
regulator-DNA interactions. We discuss new insights into global transcriptional 
regulation that are revealed by the code, including the organization of regulatory 
elements in promoters and the environment-dependent use of these elements by 
regulators. We find that environment-specific use of the regulatory code predicts 
mechanistic models for the function of a large population of yeast's transcriptional 
regulators. 
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Genome sequences contain information necessary to control gene expression programs 
and specify protein and other gene products. DNA-binding transcriptional regulators 
interpret the genome's regulatory code by binding to specific sequences to induce or 
repress gene expression '" 3 . Substantial portions of genome sequence are believed to be 
regulatory 4 ~ 8 , but the DNA sequences that actually contribute to the regulatory code are 
ill-defined. In contrast, the triplet code used to translate nucleotide sequences into 
protein molecules is well known 9-1 V Knowledge of the genome's transcriptional 
regulatory code could provide new insights into the principles that govern global gene 
regulation. 

Comparative genomics has recently been used to identify functional sequence 
elements in the yeast genome 4 ' 5 ' 1214 . Comparative analysis of the genome sequences of 
multiple yeast species revealed phylogenetically conserved sequences, and these 
sequences were used to facilitate identification of genes and putative regulatory 
elements. Conserved sequence information alone does not reveal, however, the subset 
of sequences that are bound by transcriptional regulators, the identity of the binding 
regulators, or the conditions under which the regulators occupy their binding sites. 

The set of DNA sequences that play key roles in transcriptional regulation can be 
deduced by identifying the genomic binding locations for individual regulatory proteins 
15 " 20 and by searching for DNA sequences shared by the bound sites using statistical 
algorithms 21 " 23 . However, genomic binding information is available only for a subset 
of regulators in any organism. The genomic binding sites of 1 06 transcriptional 
regulators have been identified in yeast grown in rich medium 16 , but there are 
approximately 200 transcriptional regulators encoded in the yeast genome, and many of 
these are known to function under conditions other than a rich medium environment. 
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Proteome-Genome Interactions 

To elucidate a substantial portion of the yeast genome's transcriptional regulatory code, 
we used genome-wide location analysis 15 to determine the genomic occupancy of 203 
DNA-binding transcriptional regulators in rich media conditions and, for 85 of these 
regulators, in at least one of twelve other environmental conditions. The 203 
transcriptional regulators were identified by searching the YPD and MIPS databases 24 ~ 
26 for known and predicted transcriptional regulators and nucleic acid binding proteins. 
These are likely to include nearly all of the DNA-binding transcriptional regulators 
encoded in the yeast genome. We selected regulators for profiling in a specific 
environment if they were essential for growth in that environment or if there was other 
evidence implicating them in regulation of gene expression in that environment (see 
online supporting data). The complete set of transcriptional regulators and the 
conditions under which they were profiled is listed in Supplementary Table SI , and all 
genome-wide location datasets are available at 
http://web.wi.mit.edu/young/regulatory_code. 

The genome-wide location data identified 1 1 ,000 unique interactions between 
regulators and promoter regions. There was a broad distribution in the number of 
promoter regions bound by the transcriptional regulators (Figure la); the average 
regulator bound to approximately 55 promoter regions (P < 0.001). Among the group 
of regulators that bound the most promoters, a disproportionate number were nuclear 
regulators known to be highly abundant 21 , including Abfl, Cbfl, Rebl and Stel2. 
Twenty putative regulators bound no promoter regions at high confidence (P < 0.001), 
suggesting that they function under conditions that were not explored (see below) or 
that they are not genuine transcriptional regulators. The location data also reveal that 
nearly one-half of the genes that are bound by regulators were bound by two or more 
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regulators (Supplementary Fig. S2), and that a subset of genes are bound by a very large 
number of regulators, suggesting that they are regulated by multiple signals and 
subjected to combinatorial control. 

For the 85 regulators profiled in rich media and at least one other environment, we 
found that a substantial number of promoter regions that were not bound in rich media 
were bound in the other environments. Approximately one thousand regulator-promoter 
region interactions identified in the other environments involved promoters that were 
not bound by any regulator under rich growth conditions. For example, a comparison of 
the promoter regions bound by 34 regulators in cells grown under rich or amino acid 
starved environments (Figure 1 b) reveals that there are substantial differences in the 
number of promoters bound by many of these regulators. These results underscore the 
importance of the role of environmental conditions in governing DNA-regulator 
interactions, and we explore environment-dependent global gene regulation in more 
detail below. 

DNA Binding Site Sequences 

We combined genome-wide location data, phylogenetically conserved sequence 
information, and prior knowledge of regulator binding sequences to create a database of 
DNA binding sequences for transcriptional regulators (Figure 2a). To accomplish this, 
we first combined genome-wide location and sequence conservation data to predict 
binding site specificities. For regulators where we failed to predict binding site 
specificities at high confidence, we used evidence for binding site sequences obtained 
from the literature, where available. 

Genome-wide location analysis identifies regions of the genome that are 
physically occupied by specific regulators in vivo, but does not identify the precise 
DNA sequences that serve as recognition sites. Numerous algorithms have been 
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developed to discover binding sites in a set of identified sequences. We designed a 
strategy to improve the yield of the discovery process by combining the results from six 

• 21 23 28 

programs, three of which incorporate information from phylogenetic conservation " ' 
; for a detailed description of the method, see supplementary data. Using this approach, 
we identified binding site sequence specificities for the 151 regulators that bound more 
than ten promoter regions. Of these, 68 met a high threshold criteria for significance 
(see Methods and Supplementary Table S2). 

We compared the discovered binding site sequence specificities with previous 
knowledge by generating a list of binding sequences and sequence motifs for each 
regulator that was derived solely from the evidence for protein-DNA interactions as 
contained in the databases TRANSFAC, SCPD and YPD 24 > 25 > 29 - 30 (Supplementary 
Table S2). For 39 transcriptional regulators, we found that we identified a DNA 
binding site sequence that was identical to, or shared significant homology with, 
previously described binding sequences. For example, the DNA binding sequence 
motifs for Abfl, Gal4, Gcn4, Leu3 and Stel2 were rediscovered (Figure 2b). For 21 
additional regulators, the DNA binding site sequence we predict was completely novel; 
the newly discovered sequence had not been described previously. Aft2, Rdsl, Snt2, 
Stb4 and YDR026C were among the regulators for which novel binding sequences were 
discovered (Figure 2b). For 8 regulators, the DNA binding sequence we predict differs 
from the sequence described in the literature. This discrepancy may be due to different 
growth environments used in the different studies, to noise in the genome-wide location 
data, or to limitations in the analytical methods used here or in previous studies. 

The literature suggests binding site sequences for 1 5 transcriptional regulators for 
which we did not predict a binding site sequence. We added this literature evidence to 
our database of DNA binding sequences for transcriptional regulators (Supplementary 
Table S2) and used it in subsequent analyses. 
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We note that the regulators bind to only a fraction of the sites in the genomic 
DNA that contain the predicted recognition sequence. To test whether this observation 
is due to the limitations of our methods or whether it reflects a fundamental aspect of 
genomic regulation, we investigated whether bound sites were under more selective 
pressure than sites that were not bound. We used DNA binding site specificities listed 
in the TRANSFAC database in this analysis to eliminate the sequence conservation bias 
inherent in the binding sites specificities determined as described above. Figure 2c 
shows that the genomic loci that both match specificities in the TRANSFAC database 
and were bound in our assay are more conserved in other sensu stricto Saccharomyces 
species than loci that match the TRANSFAC specificities but were not bound. For 
example, the regulator Hsfl binds to only 23/63 genomic loci that match the 
TRANSFAC specificity. While 78% of the bound sequences are conserved, only 13% 
of the remaining sites are conserved. 

There are both positive and negative regulatory mechanisms that are likely to 
account for the observation that DNA binding regulators bind only a subset of the sites 
in genomic DNA that contain that regulator's recognition sequence. Cis-acting 
regulatory DNA sequences frequently contain binding sites for multiple regulators that 
stimulate cooperative binding through protein-protein interactions or that alter local 
DNA structure. The presence of multiple regulators of this type might be expected to be 
conserved in related species, accounting for the relative selective pressure observed for 
the bound sequences. It is also the case that the binding of a protein to certain sites in 
the genome can be occluded by the presence of another protein. The proteins associated 
with genomic DNA include transcriptional regulators, the transcription apparatus, the 
DNA replication apparatus, histones and other chromatin-associated proteins, and 
chromatin modifying complexes. These are all likely to affect the relative occupancy of 
specific DNA sequences by transcriptional regulators. 
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We also note that some DNA binding regulators can occupy DNA sequences in 
the absence of a discernable binding sequence for that regulator. There are several 
examples of proteins that occupy specific sites in the absence of an appropriate 
sequence. When Tecl is present in cells, Stel2 can be found associated with DNA with 
Tecl binding sequences 31 . The Hirl/Hir2 corepressor complex, which is recruited to 
the promoters of histone genes, does not bind to a sequence that is conserved in each of 
the histone promoters 32 . The Origin of DNA Replication Complex (ORC) occupies 
specific genomic sites 33 , but a consensus sequence that would meet the confidence 
criteria used here has not been found. These latter two examples demonstrate that more 
than one unique DNA sequence can provide a functional binding site for transcriptional 
and DNA synthesis regulators, and emphasize the importance of in vivo binding data in 
discovering functional elements in the genome. 

Transcriptional regulatory code 

Gene expression programs depend on recognition of specific DNA sequences by 
transcriptional regulatory proteins, so the set of DNA sequences that are bound by these 
proteins should reveal the genome's transcriptional regulatory code. We have 
constructed a draft of the yeast genome's transcriptional regulatory code by identifying 
the positions of the sequences that are bound by regulators in vivo in the genome 
sequence. To populate this draft of the regulatory code with high confidence 
information, we further restricted the map to include sequences that were not only 
bound, but were also conserved in at least two other sensu stricto Saccharomyces 
genomes 4 . Portions of the draft code is displayed in Figure 3, and the complete map 
can be found at the authors' website. Because the information used to construct the map 
includes binding data from many different growth environments, the code describes 
transcriptional regulatory potential within the genome. 
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The first level of information contained within the regulatory code is simply the 
association of genes with the regulators responsible for their transcriptional control. 
This level of information reveals insights into the control of underlying biological 
processes. For example, find that the promoter of BAP2, which encodes an extracellular 
amino acid transporter, is bound by the amino acid biosynthetic regulators Gcn4 and 
Leu3. Similarly, we identify binding of the regulator of respiration, Hap5, to a site 
upstream of COX4, a component of the respiratory electron transport chain. The 
ribosomal regulator Rapl is bound to the small ribosomal subunit component gene 
RPS18A. Knowledge of which regulators control which genes is fundamental in 
understanding particular aspects of biology. However, in many cases, the identity of 
regulators and the arrangement of their binding sites within promoters suggests 
additional levels of organization (discussed below). 

The distribution of binding sites for transcriptional regulators reveals there are 
constraints on the organization of promoters in the yeast genome (Supplemental Fig. 
SI). Binding sites are not uniformly distributed over the promoter regions, but rather 
show a sharply peaked distribution. Very few sites are located in the region 80 bp 
upstream of protein coding sequences. This region typically includes the transcription 
start site and is bound by the transcription initiation apparatus; RNA polymerase alone 
has been shown to occupy approximately 60 bp. The vast majority (73%) of the 
transcriptional regulator binding sites lie between 100 and 500 bps upstream of the 
protein coding sequence. It appears that yeast transcriptional regulators function better 
at short distances along the linear DNA, a property that reduces the potential for 
inappropriate activation of nearby genes and allows evolution to select for organisms 
that dispense with lengthy intergenic regions. 

Promoter Architectures 
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We note that there are specific arrangements of DNA binding site sequences within 
promoters, and that these promoter architectures can provide clues to regulatory 
mechanisms. We identify four such distinctive arrangements (Figure 4) and discuss 
their biological implications below. 

Single regulator architecture. The presence of a DNA binding site for a single 
regulator is the simplest promoter architecture (Figure 4). Seven hundred sixty-three 
promoter regions were found to have single regulator promoter architecture (see online 
supporting data). These promoter regions include the binding sites for 75 of the 203 
regulators. We expect that sets of genes containing a binding site for a single regulator 
may be involved in a common function, and this is the case for genes bound by 25 
regulators (Supplementary Table S3). Analysis of expression data confirmed that for 15 
of these regulators, there is also strong correlation between single regulator promoters 
and changes in gene expression (Supplementary Table S3). We do not expect 
correlation in all cases since expression is known to be regulated at levels other than 
transcription initiation. For example, mechanisms are known that control both mRNA 
stability and export from the nucleus 34 . 

Multiple regulator architecture. Promoters with multiple regulator architecture 
contain binding sites for two or more different regulators (Figure 4). We find that 533 
promoter regions contain sequences corresponding to two or more binding site motifs. 
These regions are bound by 99 different regulators. This promoter arrangement implies 
that the gene may be subject to combinatorial regulation, and we expect that in many 
cases the various regulators can be used to execute differential responses to different 
growth conditions. Indeed, we note that many of the genes in this category encode 
products that are required for multiple metabolic pathways and are regulated in an 
environment-specific fashion. For example, the promoter of IDP1 gene, which is 
involved in energy production through oxidation, is bound not only by Hap4 and Rtg3, 
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regulators of carbon metabolism, but also by Gln3 and Gcn4. These latter proteins 
regulate amino acid biosynthetic pathways that require the same simple carbon 
compounds generated by genes under the control of Idpl . 

Repetitive motif architecture. While some promoter regions contain only a 
single copy of a particular binding site sequence, others contain repeats of these sites 
(Figure 4). Multiple binding sites have been shown to be necessary for stable binding 
by the regulator Dal80 35 5 and given the small size of typical binding site sequences, this 
requirement is likely for additional regulators. This repetitive promoter architecture can 
also allow for a graded transcriptional response, as has been observed for the HIS4 gene 
36 ' 37 . The presence of repeated binding site sequences foT a single regulator was 
observed for 71 regulators across 408 promoter regions (Figure 4). We note that a 
number of regulators, including Digl, Mbpl, and Swi6 show a statistically significant 
preference for repetitive motifs (Supplementary Table S4). 

Co-occurring regulator architecture. We searched the set of multiple regulator 
promoters to identify pairs of transcription factors whose binding sites occur more 
frequently within the same promoter regions than would be expected by chance (Table 
S5). There are ninety-three such co-occurring pairs of regulators (P < 0.005). We 
expect three types of relationships among such regulators. The first type is exemplified 
by the regulatory pair Gcn4 and Leu3. In this case, the pair of regulators share a 
functional overlap in their regulatory roles. Specifically, Gcn4 is a general regulator of 
amino acid biosynthetic genes whereas Leu3 regulates a smaller set of genes involved 
solely in the metabolism of aliphatic amino acids. This arrangement apparently allows 
cells the option of co-ordinately regulating either the complete set of amino acid 
biosynthetic genes or a specific subset. A second type of co-occurring regulator pairs 
consists of regulators that form physical associations. A well studied example of such a 
pair is Swi6 and Mbf 1 , which together form the MBF heterodimeric cell cycle 
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regulatory complex. Finally, where pairs of regulators do not interact physically, but 
bind to the same DNA sequence, the binding site is polysemic, conveying more than 
one meaning. In many cases, such regulator pairs share homologous DNA-binding 
domains. The regulators Fkhl and Fkh2, for example, are known to have overlapping 
but distinct functions, and this is reflected in their binding profiles. 

Environment-specific Binding of Regulatory Sequences 

By conducting genome-wide binding experiments for some regulators under multiple 
cell growth conditions, we learned that regulator binding to a subset of the regulatory 
sequences is highly dependent on the environmental conditions of the cell. We 
observed four common patterns of regulator binding behaviour (Figure 5, 
Supplementary Table S6). Prior information about the regulatory mechanisms 
employed by well-studied regulators in each of the four groups suggests how to account 
for the environment-dependent binding behaviour of the other regulators. 

"Condition invariant" regulators bind essentially the same set of promoters 
(within the limitations of noise) in two different growth environments (Figure 5). Leu3, 
which is known to regulate genes involved in amino acid biosynthesis, is among the 
best studied of the regulators in this group. Activation of Leu3-regulated genes has 
been shown to be independent of Leu3 binding, but requires association of a leucine 
metabolic precursor to convert it from negative to positive regulator 38 " 40 . We note that 
other zinc cluster type regulators that show "condition invariant" behaviour are known 
to be regulated in a similar manner. Thus, it is reasonable to propose that the activation 
or repression functions of some of the other regulators in this class will be independent 
of DNA binding. 

"Condition enabled" regulators do not bind the genome detectably under one 
condition, but bind a substantial number of promoters with a change in environment. 
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Msn2 is among the best-studied regulators in this class, and the mechanisms involved in 
Msn2-dependent transcription provide clues to how the other regulators in that class 
may operate. Msn2 is known to be excluded from the nucleus when cells grow in the 
absence of stresses, but accumulates rapidly in the nucleus when cells are subjected to 
stress 4M3 . This condition-enabled behaviour was also observed for the thiamine 
biosynthetic regulator Thi2, the nitrogen regulator Gatl and the developmental regulator 
RimlOl. We postulate that each of these transcriptional regulators is regulated by 
nuclear exclusion or by another mechanism that would cause this extreme version of 
condition-specific binding. 

"Condition expanded" regulators bind to a core set of target promoters under one 
condition, but bind an expanded set of promoters under another condition. Gcn4 is the 
best-studied of the regulators that fall into this "expanded" class. The levels of Gcn4 are 
reported to increase 6-fold when yeast are introduced into media with limiting nutrients 
44 , due largely to increased nuclear protein stability 42 ' 45>46 ( and under this condition we 
find Gcn4 binds to an expanded set of genes. Interestingly, the probes bound when 
Gcn4 levels are low contain better matches to the known Gcn4 binding site than probes 
that are bound exclusively at higher protein concentrations, consistent with a simple 
model for specificity based on intrinsic protein affinity and protein concentration 
(Supplementary Figure SI). The expansion of binding sites by many of the regulators 
in this class likely reflects increased levels of the regulator available for DNA binding. 

"Condition altered" regulators exhibit altered preference for the set of promoters 
bound in two different conditions. Stel2 is the best studied of the regulators whose 
binding behaviour falls into this "altered" class. The specificity of Stel2 is thought to 
depend, in part, on its interactions with other regulators whose availability is 
environment-dependent. For example, the binding site preference of Stel2 changes 
when Tecl is present, apparently because Tecl interacts with Stel2 and has its own 
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DNA-binding site specifity ' . This condition-altered behaviour was also observed for 
the transcriptional regulators Aft2, Pho4, and Ume6. We postulate that the binding 
specificity of many of the transcriptional regulators may be altered through interactions 
with other regulators or through modifications (e.g., chemical) that are environment- 
dependent. 

We note that classification of regulator behaviour is dependent on the 
environmental conditions selected for comparison, and that some regulators fall into 
multiple categories. This is consistent with the understanding that multiple types of 
regulatory mechanisms are often associated with regulators, and differentially modulate 
regulator behaviour. We anticipate that future experiments in additional environments 
will provide a more comprehensive understanding of regulator dynamics. 

Challenges for Future Drafts of the Regulatory Code 

We have used extensive in vivo binding data, conserved sequence information and prior 
knowledge of regulator-DNA interactions to construct a first draft of the transcriptional 
regulatory code of a eukaryote cell. We anticipate that future revisions will be 
facilitated by collecting more experimental data, by testing models that emerge from the 
data, and by developing improved computational algorithms to integrate various data 
types. It will be valuable to collect genome-wide binding data for DNA-binding 
regulators and chromatin regulators in cells grown under additional environments. 
Experimental tests of models for regulator functions predicted by their environment- 
dependent binding behaviour will provide new insights into the regulatory mechanisms 
involved in control of global gene expression programs. Knowledge of the 
environment-dependent changes in the abundance, modification state and intracellular 
compartmentalization of transcriptional regulators will also be valuable, although 
collecting this evidence will be challenging because transcriptional regulators are 
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among the least abundant of cellular proteins. Frequent sampling of genome-wide 
expression data obtained as cells are placed in the new environments will permit 
investigators to integrate binding and expression data to explain how dynamic changes 
in gene expression programs are regulated. New computational algorithms will play a 
key role in improved binding site sequence prediction and allowing integration of 
various data types to reveal how the transcriptional regulatory code controls gene 
expression programs under diverse conditions. 



Methods 

Strain Information 

Strains were created for each of the 204 regulators in which a repeated Myc 
epitope coding sequence was integrated into the endogenous gene encoding the 
regulator. PCR constructs containing the Myc epitope coding sequence and a selectable 
marker flanked by regions of homology to either the 5' or 3' end of the targeted gene 
were transformed into the W303 yeast strain Z1256. Genomic integration and 
expression of the epitope-tagged protein were confirmed by PCR and Western blotting, 
respectively. 

Genome wide location analysis 

Genome-wide location analysis was performed as previously described. Bound 
proteins were formaldehyde-crosslinked to DNA in vivo, followed by cell lysis and 
sonication to shear DNA. Crosslinked material was immunoprecipitated with an anti- 
myc antibody, followed by reversal of the crosslinks to separate DNA from protein. 
Immunoprecipitated DNA and DNA from an unenriched sample were amplified and 
differentially fluorescently labeled by ligation-mediated PCR. These samples were 
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hybridized to a microarrray consisting of spotted PCR products representing the 
intergenic regions of the S. cerevisiae genome. Relative intensities of spots were used 
as the basis for an error model that assigns a probability score (P value) to binding 
interactions. 

Growth environments 

We profiled all 203 regulators in rich medium. In addition, we profiled 85 
regulators in at least one other environmental condition. The list of regulators is given 
in Supplementary Table S 1 and more information about the selection of these regulators 
and the environmental conditions used can be found at the author's web site. 

Motif discovery 

We used six methods to identify the specific sequences bound by regulators: 
AlignACE, MEME, MDscan, the method of Kellis et al. and two additional new 
methods that incorporate conservation data: MEME_c and CONVERGE. MEME_c 
uses the existing MEME program without change, but applies it to a modified set of 
sequences in which non-conserved bases were replaced with the letter "N". 
CONVERGE, is a novel EM-based algorithm that takes a set of multiple sequence 
alignments (MSA) as input instead of a set of sequences. Each MSA contains the 
available conservation information for a single probe across the sensu stricto species. 
Rather than searching for sites that are identical across multiple species, as is the case 
for MEME_c, CONVERGE searches for loci where all aligned sequences are consistent 
with the same specificity model. CONVERGE is described in greater detail in the 
supplemental material and will be published elsewhere. 
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We used XXX statistics to judge the significance of motifs A, B, C ... 
(described in the supplementary material). To determine the appropriate thresholds for 
these measures, we applied each program to sets of randomly selected probes and 
calculated the empirical probability distribution for each program to find a motif with 
the given score. A motif was accepted if any one of that scores had a p-value < 0.001 
when compared distribution of the same score observed in 50 randomization runs with 
the same program. Significant motifs for the same factor were clustered and averaged. 

Regulatory code 

Potential binding sites were included in the map of the regulatory code if they satisfied 
two criteria. First, a locus had to match the specificity model for a regulator in the 
Saccharomyces cerevisiae genome and at least two other sensu stricto cerevisiae 
genomes with a score >70% of the maximum possible. Second, the locus had to lie in 
an intergenic region that also contained a probe bound by the corresponding factor 
(p<0.001). 
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Figure 1. Genome-wide Distribution of Transcriptional Regulators. Genome- 
wide location analysis was used to determine the genomic occupancy of 203 
DNA-binding transcriptional regulators in rich media conditions and, for 85 of 
these regulators, in at least one of twelve other environmental conditions. 
Additional information on experimental protocols, the regulators under study, 
raw location data, and data analysis methods are available in supplementary 
data and on the author's website 48 . The data selected for further analysis met a 
0.001 P value threshold (Lee et al., 2002). a. Distribution of the number of 
promoter regions bound per regulator. For regulators profiled under multiple 
conditions, the union of promoter regions bound under all conditions is reported 
(blue). An average of randomized distributions for the same set of P values 
randomly assigned among regulators and promoter regions is shown in pink. b. 
Pairwise comparison of the number of promoter regions bound under two 
different conditions for 25 regulators. Dark blue bars represent the number of 
promoter regions bound under growth in rich medium; light blue bars represent 
the number of promoter regions bound under growth in amino acid starvation 
medium. 

Figure 2. Binding Site Sequences for Yeast Transcriptional Regulators, a. 
Experimental procedure. Data from location analysis experiments were 
subjected to computational analyses to determine regulator binding site 
sequence specificities. These sequences were compared to and supplemented 
with published sequence specificities. Phylogenetic conservation information 
was used both to determine, in part, the binding sequences and in selecting the 
set of final sequences for inclusion in the map of the regulatory code. b. 
Specificity models that were "rediscovered" using our location data (left) as well 
as for newly discovered sequences (right). The height of the letter in each logo 
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represents the frequency of the given nucleotide at that position within the 
collection of binding specificities, c. Conservation of bound motifs. The 
degreee of conservation of genomic sites that match a TRANSFACE matrix and 
were bound (red) or unbound (dark blue). The distribution of conservation 
values for all possible six-base pair sequences is shown for comparison (cyan). 

Figure 3. Yeast Transcriptional Regulatory Code. A genomic map of regulatory 
code sequences. Regions of Chromosomes II, IV and VII are shown including 
the locations of conserved DNA sequences bound in vivo by transcriptional 
regulators. Genes are shown as grey rectangles with arrows indicating 
direction of transcription; binding sequences are shown as coloured boxes. 

Figure 4. Yeast promoter architectures. Four classes of arrangements present 
in promoter regions are depicted. 

Figure 5. Environment-specific Utilization of Transcriptional Regulatory Code. 
Four patterns of genome-wide binding behaviour are depicted in a graphic 
representation on the left, where transcriptional regulators are represented by a 
coloured circle, a representative set of gene/promoters is represented above 
and below the regulator, and lines between the regulator and the 
gene/promoters represent binding events. Specific examples of the 
environment-dependent behaviours are depicted in the middle. Coloured 
circles represent regulators and coloured boxes represent DNA binding 
sequences present within promoter regions. 

Figures 8-39. Specific Embodiments of the methods described herein. 
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Description of Supplementary Tables and Figures: 

Supplementary Table S1. List of all factors profiled and conditions used. 

Supplementary Table S2. List of all binding sequences used for creating 
map. 

Supplementary Table S3. Table of Single Regulator MIPs annotations. 
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Supplementary Table S4. List of regulators and genes with repetitive 
motif architecture. 



Supplementary Table S5. List of regulators and genes with co-occurring 
motif architecture. 



Supplementary Table S6. Classification of all regulators into binding 
behaviour categories. 

Supplementary Figure S1 . Distance to ATG. 
Supplementary Figure S2. Gcn4 motifs. 
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Supplementary Table S2 



Regulator Discovered Known Programs 

Specificity 1 Specificity 1 ' 2 

Abfl rTCAyt....Acg rTCAyT....ACGw A, C, D, K, M, N 

Ace2 tGCTGGT GCTGGT K 

Adrl GGrGk 

Aft2 GGGTGy "ATCTTCAAAAGTGCACCCAT. . . 

...TTGCAGGTGC" A, C, D, M, N 

Arrl TTACTAA 

Ashl yTGACT 

Azfl YwTTkcKkTyyckgykky AAGAAAAA N 

Basl TGACTC A, K, M, N 

Cadi mTTAsTmAkC TTACTAA A, C, D, M, N 

Cbfl tCACGTG rTCACrTGA A, C, D, K, M, N 

Cin5 TTAygTAA TTACTAA A, C, D 

Crzl GwGGCTG 

Dal80 GATAA GATAAG 

Dal81 AAAAGCCGCGGGCGGGATT 

Dal82 GATAAGa "GCTGAAAGTTGCGGTGCGATA. . . 

...GAATACCGCGGATTTTGGAA" K, D 

Digl TgAAAca A, C, D, K, M, N 

Ecm22 CTCGTATAAGC 

Fhll rTGTayGGrtg A, C, D, K, M, N 

Fkhl tTgTTTac GGTAAACAA A, C, D, K, M, N 

Fkh2 aaa. GTAAAC Aa GGTAAACAA A, C, D, K, M, N 

Gal4 CGG cCg CGG CCG A, K 

Gal80 CGG CCG 

Gatl aGATAAG GATAA K 

Gcn4 TGAsTCa ArTGACTCw A, C, D, K, M, N 

Gcrl GGCTTCCwC 

Gln3 GATAAGa. a GATAAGATAAG C, D, K 
Gzf3 GATAAG GATAA 

Hacl kGmCAGCGTGTC 



Hapl GGmraTA.CGs CGG...TA.CGG C, M 

Hap2 CCAAT ayc.ccaat.a.m 

Hap3 CCAAT ayc..ccaat.a.m 

Hap4 g.CcAAtcA ayc.ccaat.a.m A, C, D, M, N 

Hap5 ~ CCAAT 

Hsfl TTCya TTC AGAA. .TTCTAGAA A, C, D, K, M, N 

Imel AAkGAAA.kwA A 

Ino2 CAcaTGc GATGTGAAAT C, D, M, N 

Ino4 CATGTGaaaa CATGTGAAAT A, C, D, K, M, N 

Leu3 cCGgtacCGG yGCCGGTACCGGyk A, D, K, M, 

Macl GAGCAAA 

Mbpl rACGCGt ACGCGT A, C, D, K, M, N 
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Mcml tttCC.rAt..gg wTTCCyAAw..GGTAA A, C, D, M, N 

Met31 AAACTGTGG 

Met32 AAACTGTGG 

Met4 RMmAwsTGKSgyGsc C 

Mot3 yAGGyA 

Msn2 mAGGGGsgg mAGGGG M 

Msn4 mAGGGG 

Nddl tt.CC.rAw..GG A, D 

Nrgl GGaCCCT TCCCTCATTTC A, C, D, M, N 

Opil TCGAAyC 

Pdrl ccGCCgRAwra CCGCGG M 
Pdr3 TCCGCGGA 

Phdl sc.GC.gg A, D, N 

Pho2 SGTGCGsygyG N 

Pho4 CACGTGs cacgtk.g K, D, N 

Put3 CGG CCG 

Rapl tGyayGGrtg wrmACCCATACAyy A, C, D, M, N 

Rcsl ggGTGca.t AmTGC ACCC AkTT C, D, M, N 

Rdsl kCGGCCGa D, N 

Rebl CGGGTAA TTACCCGG A, C, D, K, M, N 

Rfxl TTgccATggCAAC D 

Rgtl CGGA..A 

RimlOl TGCCAAG 

Rim 1 CTAwwwwTAG 

Rlrl ATTTTCttCwTt N 

Roxl ysyATTGTT 

Rphl CCCCTTAAGG AGGGG 

Rpn4 TTTGCCACCGGTGGCAAA A, C, D, K, M, N 

Rtgl GGTCAC 

Sfll GAAGCTTC 

Sfpl ayCcrtACay A, C, D, M, N 

Sigl ArGmAwCrAmAA M 

Sip4 CGG.y.AATGGrr yCGGAyrrAwGG D 

Skn7 G.C.GsCs ATTTGGCyGGsCC A, C, D, M, N 

Skol ACGTCA 

Smpl ACTACTAwwwwTAG 

Snt2 yGGCGCTAyca A, C, D, M, N 

Sok2 tGCAg..a A 

Spt2 ymtGTmTytAw M 

Spt23 rAAATsaA C 

Stbl rracGCsAaa C, D, K, M, N 

Stb4 TCGg..CGA K 

Stb5 CGGwstTAta CCG D, N 

Stel2 tgAAACa ATGAAAC A, C, D, K, M, N 

Stpl rCGGC.rCGGC 

Suml gyGwCAswaaw AGyGwC ACAAAAk A, C, D, M, N 

Sutl gcsGsg..sG A, D, M 

Swi4 raCgCsAAA CCG AAA A, C, D, K, M, N 
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Swi5 kGCTGr 

Swi6 tttcGCGt C.CGAAA A, C, D, M, N 

Tecl rrGAATG CATTCy 

Thi2 gmAAcy.twAgA C, D 

Tye7 tCACGTGAy CA..TG A, C, D, M 

Uga3 CCG....CGG 

Ume6 taGCCGCCsa wGCCGCCGw A, C, D, K, M, N 

Xbpl CTTCGAG 

Yapl TTaGTmAGc TGAsTCAG A, C, D, M 
Yap3 TTACTAA 
Yap5 TTACTAA 
Yap6 TTACTAA 

Yap7 mTkAsTmAk TTACTAA A, C, D, M, N 

Ydr026C ttTACCCGGm C, D, M, N 

Yhpl TAATTG 

Yoxl AsAATA.TGAmr yAATTA 

Zapl ACCCTmAAGGTyrT ACCCTAAAGGT 

lAmbiguity Codes: S = CG, W = AT, R = AG, Y = CT, K = GT, M = AC, "." = 
ACGT. Letter capitalization reflects the information content at each position of the 
motif. 

2Known specificities are taken from the YPD, SCPD, and TRANSFAC databases. 

3Program Codes: A = AlignACE, C = CONVERGE, D = MDscan, K = Kellis et. al, M 
= MEME, N = MEME (consensus genome) 
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Supplementary Table OS3 




Regulator 


Environment 


Functional category 


Cbfl 


SM 


amino acid metabolism 


Digl 


But 


morphogenesis 


Fhll 


YPD, SM, Rap, 


Protein biosynthesis 


Fkhl 


YPD 


Cell cycle 


Gal4 


Gal, Raf 


carbohydrate metabolism 


Gcn4 


YPD, SM, Rap 


amino acid metabolism 


Gln3 


YPD 


amino acid metabolism 


Hapl 


YPD 


electron transport 


Hap2 


Rap 


cellular respiration 


Hap4 


YPD 


cellular respiration 


Hsfl 


H202Hi, H202LO 


response to stress 


Ino4 


YPD 


lipid metabolism 


Leu 3 


YPD 


amino acid metabolism 


Mbpl 


H202Hi, H202LO 


DNA metabolism 


Mcml 


Alpha 


Cell cycle 


Msn2 


H202LO 


carbohydrate metabolism 


Rcsl 


H202Hi, H202LO 


transport 


Rdsl 


H202Hi 


carbohydrate metabolism 


Rebl 


H202Lo 


vesicle-mediated transport 


Rpn4 


H202HI, H202LO 


protein catabolism 


Sok2 


But 


meiosis 


Stel2 


YPD, Alpha, But, 


conjugation 


Suml 


YPD 


sporulation 


Tecl 


But 


cell wall organization 


Yap6 


H202LO 


response to stress 
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Supplementary Table S5 

Bound by Bound by Bound by 



Regulator 1 


Regulator 2 


P value Regulator 1 


Regulator 2 


both 




ACE2 


SWI5 


0 


11 


23 


5 


AFT2 


RCS1 


0 


41 


19 


11 


ARR1 


YAP3 


0.001 


1 


1 


1 


AZF1 


GZF3 


0.002 


1 


3 


1 


BAS1 


MET4 


0.002 


18 


5 


2 


CADI 


YAP1 


0 


9 


12 


4 


CADI 


YAP7 


0 


9 


51 


9 


CBF1 


MET31 


0.002 


102 


4 


3 


CBF1 


MET32 


0 


102 


10 


6 


CBF1 


MET4 


0.004 


102 


5 


3 


CBF1 


PH04 


0.001 


102 


15 


6 


CBF1 


TYE7 


0 


102 


24 


17 


CIN5 


PHD1 


0 


54 


30 


8 


CIN5 


SKN7 


0 


54 


57 


9 


CIN5 


SOK2 


0 


54 


34 


9 


CIN5 


SUT1 


0 


54 


26 


6 


CIN5 


XBP1 


0.002 


54 


2 


2 


CIN5 


YAP6 


0.005 


54 


3 


2 


DAL82 


GAT1 


0 


20 


7 


3 


DAL82 


GLN3 


0 


20 


30 


7 


DAL82 


HAP2 


0.002 


20 


18 


3 


DIG1 


MCM1 


0 


63 


42 


9 


DIG1 


STE12 


0 


63 


87 


49 


DIG1 


SWI4 


0 


63 


53 


14 


DIG1 


SWI6 


0 


63 


71 


12 


DIG1 


TEC1 


0 


63 


32 


16 
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FHL1 


RAP1 


0 


61 


54 


30 


FHL1 


SFP1 


0 


61 


15 


13 


FKH1 


FKH2 


0 


62 


52 


20 


FKH2 


MCM1 


0 


52 


42 


12 


FKH2 


NDD1 


0 


52 


24 


14 


FKH2 


SWI6 


0 


52 


71 


12 


GCN4 


GLN3 


0.004 


93 


30 


7 


GCN4 


LEU3 


0.002 


93 


9 


4 


GCR1 


TYE7 


0.002 


4 


24 


2 


GLN3 


HAP2 


0 


30 


18 


5 


GZF3 
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